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ABSTRACT 

In  this  paper,  we  consider  facial  expression  recognition  using  an  unsupervised  learning  framework.  Specifically, 
given  a  data  set  composed  of  a  number  of  facial  images  of  the  same  subject  with  different  facial  expressions,  the 
algorithm  segments  the  data  set  into  groups  corresponding  to  different  facial  expressions.  Each  facial  image  can 
be  regarded  as  a  point  in  a  high- dimensional  space,  and  the  collection  of  images  of  the  same  subject  resides  on  a 
manifold  within  this  space.  We  show  that  different  facial  expressions  reside  on  distinct  subspaces  if  the  manifold 
is  unfolded.  In  particular,  semi-definite  embedding  is  used  to  reduce  the  dimensionality  and  unfold  the  manifold 
of  facial  images.  Next,  generalized  principal  component  analysis  is  used  to  fit  a  series  of  subspaces  to  the  data 
points  and  associate  each  data  point  to  a  subspace.  Data  points  that  belong  to  the  same  subspace  are  shown  to 
belong  to  the  same  facial  expression. 

Keywords:  Facial  expression  recognition,  unsupervised  learning,  dimension  reduction,  semi-definite  program¬ 
ming,  manifold  unfolding,  principal  component  analysis 

1.  INTRODUCTION 

The  human  face  is  a  rich  medium  through  which  people  communicate  their  emotions.  Researchers  have  identified 
six  basic  human  expressions,  namely,  happiness,  sadness,  anger,  disgust,  fear,  and  surprise.1  Automatic  facial 
expression  recognition  algorithms  can  be  used  in  systems  involving  human-computer  interaction.  An  emerging 
field  of  application  for  facial  expression  recognition  algorithms  involves  clinical  decision  support  systems.3,4 
Specifically,  the  authors  in  Refs.  5  and  6  present  a  framework  for  assessing  pain  and  pain  intensity  in  neonates 
using  digital  imaging. 

Among  different  approaches  proposed  for  facial  expression  recognition  are  manifold-based  methods.7  In  these 
methods,  the  facial  image  can  be  regarded  as  a  point  in  a  D-dimensional  space  (which  is  referred  to  as  the 
ambient  space),  where  D  is  the  number  of  pixels  in  the  image  or  the  number  of  parameters  in  a  face  model. 
The  underlying  assumption  in  manifold-based  methods  is  that  a  set  of  facial  images  of  a  subject,  which  are 
represented  by  a  set  of  points  in  a  high-dimensional  ambient  space,  resides  on  an  intrinsically  low-dimensional 
manifold.  Hence,  an  important  part  of  the  facial  expression  recognition  algorithm  in  such  methods  involve  finding 
the  manifold  of  facial  expressions. 

In  this  paper,  we  propose  an  unsupervised  learning  approach  to  facial  expression  recognition,  where  we 
show  that  different  facial  expressions  reside  on  distinct  subspaces  if  the  manifold  of  facial  images  is  unfolded.8 
Specifically,  we  introduce  a  new  manifold-based  method,  where  we  use  a  maximum  variance  unfolding  (MVU) 
approach8  to  identify  the  low-dimensional  manifold  of  facial  images  and  unfold  it.  Next,  generalized  principal 
component  analysis  is  used  to  fit  a  series  of  subspaces  to  the  data  points  and  associate  each  data  point  to  a 
subspace.  Data  points  that  belong  to  the  same  subspace  are  shown  to  belong  to  the  same  facial  expression. 
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The  contents  of  the  paper  are  as  follows.  First,  we  review  the  MVU  dimension  reduction  technique,  which 
involves  semi-definite  programming  and  convex  optimization.  In  Section  3,  we  review  the  generalized  principal 
component  analysis  (GPCA).  This  framework  uses  algebro-geometric  concepts  to  address  the  problem  of  data 
segmentation  and  subspace  identification  for  a  given  set  of  data  points.  In  Section  4,  the  MVU  and  GPCA 
methods  are  used  to  recognize  facial  expressions  from  a  given  set  of  images  within  an  unsupervised  learning 
framework.  Finally,  we  draw  conclusions  in  Section  5. 

The  notation  used  in  this  paper  is  fairly  standard.  Specifically,  Z+  denotes  the  set  of  positive  integers,  R 
denotes  the  set  of  real  numbers,  (-)T  denotes  transpose,  and  (•)!  denotes  the  Moore-Penrose  generalized  inverse. 
Furthermore,  we  write  tr(-)  for  the  trace  operator,  Af(-)  for  null  space,  ||  •  ||  for  the  Euclidean  norm,  and  dim(<S) 
for  the  dimension  of  a  set  S  C  R13,  where  D  €  Z+. 

2.  MANIFOLD  UNFOLDING  AND  DIMENSION  REDUCTION 

In  this  section,  we  introduce  the  method  of  maximum  variance  unfolding  (MVU),  which  involves  a  dimension 
reduction  technique  that  uses  semi-definite  programming  and  convex  optimization.  Given  a  set  of  points  sampled 
from  a  low-dimensional  manifold  in  a  high-dimensional  ambient  space,  this  technique  unfolds  the  manifold  (and 
hence,  the  points  it  contains)  while  preserving  the  local  geometrical  properties  of  the  manifold.8  This  method 
can  be  regarded  as  a  nonlinear  generalization  of  the  principal  component  analysis  (PCA).8 

Given  a  set  of  points  in  a  high-dimensional  ambient  space,  principal  component  analysis  identifies  a  low¬ 
dimensional  subspace  such  that  the  variance  of  the  projection  of  the  points  on  this  subspace  is  maximized. 
More  specifically,  the  basis  of  the  subspace  on  which  the  projection  of  the  points  has  the  maximum  variance 
is  the  eigenvectors  corresponding  to  the  non-zero  eigenvalues  of  the  covariance  matrix.9  In  the  case  where  the 
data  is  noisy,  the  singular  vectors  corresponding  to  the  dominant  singular  values  of  the  covariance  matrix  are 
selected.10,1"1 

Given  the  set  of  N  input  points  X  =  {x\,  X2,  ■  ■  ■ ,  a; at}  C  R13,  where  D  is  the  dimension  of  the  ambient  space, 
we  seek  to  find  the  set  of  N  output  points  y  =  {yi,  y2,  •  •  • ,  Vn}  C  Rd  such  that  d  <  D,  X  and  y  are  equivalent, 
and  points  sufficiently  close  to  each  other  in  the  input  data  set  X  remain  sufficiently  close  in  the  output  data  set 
y.  Recall  that  two  sets  X  and  y  are  equivalent  if  and  only  if  there  exists  a  bijective  (one-to-one  and  onto)  map 
/  :  X  — >  y.  To  address  this  problem,  the  concept  of  isometry  for  a  set  of  points  is  needed.8’12  In  particular, 
an  isometry  is  a  diffeomorphism  defined  on  a  manifold  such  that  it  admits  a  local  translation  and  rotation.  The 
next  definition  extends  the  notion  of  isometry  to  data  sets. 

Definition  2.1. 8  Let  X  =  {xi,  x2,...,  xn}  C  R13  and  y  =  {y\,  j/2,  •  •  • ,  Vn}  C  Rd  be  equivalent.  Then  X 
and  y  are  k- locally  isometric  if  there  exists  a  continuous  map  T  :  S.D  — >  Rd  such  that  if  T(xi )  =  yi,  then 
T{NXi(k))  =  NVi(k),  i  =  where  NXi(k )  (resp.,  NVi{k))  is  the  set  of  k -nearest  neighbors  of  Xi  €  X 

(resp.,  yi  e  y). 

Before  stating  the  MVU  method,  we  introduce  the  following  maximization  problem. 

Maximum  Variance  Unfolding  Problem.  Given  a  set  of  input  data  points  X  =  {x\,  x2, . . . ,  Xn}  C  R13 
find  the  set  of  output  data  points  y  =  {y\,  y2, . . . ,  j/jv}  C  Rd,  where  d  <  D,  such  that  the  sum  of  pairwise  square 
distances  between  the  outputs  in  y  given  by 


1  N  N 
i=i  j= l 

is  maximized,  and  X  and  y  are  fc-locallv  isometric  for  some  k  G  Z+. 

Without  loss  of  generality,  we  assume  that  YliLi  xi  =  0-  Moreover,  we  require  Vn  =  0  to  remove  the 

translational  degree  of  freedom  in  the  output  points  contained  in  y.  Note  that  a  data  set  (e.g.,  X)  can  be 
represented  by  a  weighted  graph  0,13  where  each  node  represents  a  point  and  the  k- nearest  points  are  connected 
by  edges,  where  k  is  a  given  parameter.  The  weights  of  ©  represent  the  distance  between  the  nodes.  In  addition, 
we  assume  that  the  graph  ©  is  connected.13  In  the  case  of  a  disconnected  graph,  each  connected  component 
should  be  analyzed  separately.  The  fc-local  isometry  condition  in  the  Maximum  Variance  Unfolding  Problem 


Figure  1.  Original  and  modified  graphs  for  k  =  2. 

requires  that  the  distances  and  the  angles  between  the  fc-nearest  neighbors  are  preserved.  This  constraint  is 
equivalent  to  preserving  the  distances  between  neighboring  points  in  a  modified  graph  ©',  where,  for  each  node, 
all  the  neighboring  nodes  of  ©'  are  connected  by  an  edge.  More  precisely,  each  node  of  ©'  and  the  /c-neighboring 
nodes  of  ©'  form  a  clique  of  size  k  +  1  (see  Figure  l).13 

The  next  theorem  gives  the  solution  to  Maximum  Variance  Unfolding  Problem  for  the  case  where  d  =  D. 

Theorem  2. 2. 8  Consider  the  Maximum  Variance  Unfolding  Problem  with  d  =  D.  The  output  data  points  in 
y  =  { j/i ,  7/2, ... ,  Vn}  C  S.d  are  given  by  the  solution  to  the  optimization  problem 

max  <f>,  (2) 

9l,S2,-.,W6RD 

subject  to 

N 

=  °>  (3) 

2=1 

hi-yjW2  =  Dij ,  */  V(i,j)  =  !>  i,j  =  !,•••,  N,  (4) 

where  $  is  defined  in  (1),  ?/  =  [r](ij)]  €  M.NxN  is  the  adjacency  matrix  of  the  modified  graph  & ,  and 

Dij  —  || Xi  Xj  ||  ,  7,  j  1,  .  .  .  ,  N ,  Xi ,  Xj  £  (5) 


The  optimization  problem  (2)-(4)  is  not  convex.  The  following  convex  optimization  problem,  however,  is 
equivalent  to  the  optimization  problem  given  in  Theorem  2.2.  Moreover,  the  following  result  also  addresses  the 
case  where  d  <  D. 

Theorem  2. 3. 8  Consider  the  Maximum  Variance  Unfolding  Problem  with  d  =  D  and  let  ©  and  &  denote 
the  weighted  graph  and  modified  graph  corresponding  to  the  data  set  X ,  respectively.  The  output  data  points  in 
y  =  {yi,  2/2,  •  •  • ,  Vn}  C  Rd  are  given  by 

Vji  =  j  =  l,...,N,  i  =  l,...,D,  (6) 

where  Vji,  j  =  1,  i  =  1  is  the  ith  component  of  the  jth  eigenvector  of  K*  given  by  Vj  = 

[Vji,  Vj2,  •••,  VjD]T,  A  j  is  the  associated  eigenvalue,  yji,  j  =  1  i  =  1  is  the  ith  component 

of  the  vector  yj  =  [yji,  yj2,  ■  ■  ■ ,  2/j\d]T,  and  K*  is  the  optimal  solution  to  the  optimization  problem 

maxtr(A'),  (7) 

k  gk 


subject  to 


K  >  0 


(8) 

(9) 


N  N 

EE%>  =  0 

*=  1  3=  1 

+  K(jJ)  Dij,  if  T] (ij)  15  ij  j  1?  •  •  •  5  N,  (10) 

where  IK  C  RNxN  denotes  the  cone  of  nonnegative  definite  matrices,  r)  =  [V(i,j)]  €  M.NxN ,  and  Dij  is  defined  as 
in  (5).  Moreover,  if  I\*  has  d  <  D  non-zero  eigenvalues,  then  the  set  of  reduced  dimension  output  data  points 
is  given  by  y  =  {j/ied,  y™*1,  ■  ■  ■ ,  2/jvd }  C  where  y\eA ,  i  =  1, . . . ,  N ,  is  found  by  removing  the  zero  elements  of 
Vi- 

Remark  2.4.  When  the  data  is  noisy,  all  the  eigenvalues  of  K  are  typically  non-zero,  and  hence,  one  has  to 
choose  the  dominant  eigenvalues  of  I\* .10, 11  If  the  eigenvalues  of  K  are  sorted  in  descending  order,  then  the 
first  d  components  of  yi,  i  =  1, . . . ,  N ,  form  a  d-dimensional  data  set  that  is  approximately  k-locally  isometric 
to  X  =  {xi,  X2,  .  .  .  ,  Xn}  C  R0.8 

3.  DATA  SEGMENTATION  AND  SUBSPACE  IDENTIFICATION 

In  this  section,  we  address  the  problem  of  data  segmentation  and  subspace  identification  for  a  given  set  of  data 
points.  First,  we  define  the  multiple  subspace  segmentation  problem. 

Data  Segmentation  and  Subspace  Identification  Problem.10, 11  Given  the  set  y  =  {y i,  y2,  ■  ■  • ,  Pn}  C 

M.D ,  where  y,,  i  =  1, . . . ,  N,  are  drawn  from  a  set  of  distinct  subspaces  Sj ,  j  =  1, . . . ,  n,  of  unknown  number  and 
dimension,  find  i)  the  number  of  subspaces  n,  ii )  their  dimensions  dim(iSj),  and  Hi)  the  basis  for  each  subspace. 
Furthermore,  associate  each  point  to  the  subspace  it  belongs  to. 

GPCA  uses  algebro-geometric  concepts  to  address  this  problem.  First,  we  present  the  GPCA  algorithm 
followed  by  a  more  robust  version  of  GPCA  to  deal  with  noisy  data.  For  a  detailed  discussion,  see  Refs.  10  and 
11. 

3.1  Generalized  Principal  Component  Analysis 

In  this  subsection,  we  present  the  GPCA  algorithm,  where  we  assume  that  the  data  points  are  noise-free. 
The  GPCA  algorithm  consists  of  three  main  steps;  namely,  polynomial  fitting,  polynomial  differentiation,  and 
polynomial  division.  The  following  definitions  are  needed. 

Definition  3.1. 10,14-16  Let  TZ  be  a  commutative  ring  and  let  X  be  an  additive  subgroup  of  1Z.  X  is  called  an 
ideal  if  r  £  TZ  and  s  £  X,  then  rs  £  X.  Furthermore,  an  ideal  is  said  to  be  generated  by  a  set  S  if,  for  all  t  £  X, 
t  =  X^r=i  risi>  Ti  £IZ,  Si  £  S,  i  =  1, . . .  ,n,  for  some  n  £  Z+.  Let  F[a:]  be  the  set  of  polynomials  of  D  variables, 
where  x  =  [xi,  X2,  ■  ■  ■ ,  Xd]t ,  x,  £  F,  i  =  1, . . . ,  D,  and  F  is  a  field.  Then  F[x],  with  polynomial  addition  and 
multiplication,  forms  a  commutative  ring.  A  product  of  n  variables  X\,X2,  ■  ■  ■  ,xn  is  called  a  monomial  of  degree 
n  (counting  multiplicity) .  The  number  of  distinct  monomials  of  degree  n  is  given  by 

Mn(D)  =  C(D  +  n-l,n),  (11) 

where  C(p,q)  denotes  the  combination  of  p  objects  taken  q  at  a  time.  A  polynomial  with  all  of  its  terms  being 
the  same  degree  is  called  a  homogenous  polynomial.  An  ideal  that  is  generated  by  homogenous  polynomials  is 
called  a  homogenous  ideal.  Finally,  the  Veronese  Map  of  degree  n  is  a  mapping  un  :  FD  — *  FMn^D^  that  assigns 
to  the  variables  Xi,X2 Xd  all  the  possible  monomials  of  degree  n;  namely, 

Vn([x  1,  X2,  .  •  -  ,  Xd]T)  =  [til,  U2,  ...,  UMn(D)]T,  (12) 

where  Ui  =  x^xlf'2  ■  ■  ■  xf)D ,  i  =  1, . . . ,  Mn(D),  and  where  nn  +  na  +  ■  ■  ■  +  nto  =  n,  nij  £  Z+;  j  =  1, . . . ,  D, 
and  Un , . . . ,  Uid  are  in  lexicographic  order. 


Let  A  =  {1S1, <S2,  •  ■  ■ , iS„},  Za  =  5i  U  ^2  U  •  •  •  U  Sn,  where  Sj  C  KD,  j  =  1 ,n,  is  a  linear  subspace, 
dim(5'j)  =  dj,  and  n  £  Z+.  A  is  referred  to  as  a  subspace  arrangement.  In  addition,  let  y  =  {yi,  t/2 ,  •  •  • ,  Vn}  C 
be  a  set  of  a  sufficiently  large  number  of  points  sampled  from  Za-  In  this  paper,  we  assume  that  the  number 
of  subspaces  n  is  known.  However,  the  GPCA  algorithm,  in  its  most  general  form,  gives  the  solution  for  the 
case  where  n  is  unknown.10,11  In  order  to  algebraically  represent  Z_ 4,  we  need  to  find  the  vanishing  ideal  of 
Za,  denoted  by  T(fZjf).  The  vanishing  ideal  of  Za  is  the  set  of  polynomials  which  vanish  on  Za-  It  can  be 
shown  that  the  homogenous  component  of  I(Za),  denoted  by  Tn ,  uniquely  determines  I(Za).  Hence,  to  find 
the  vanishing  ideal  I(Za)  it  suffices  to  determine  the  homogenous  component  In. 

If  pn(x)  is  a  polynomial  in  Tn,  then  pn(x)  =  cf)vn(x),  where  cn  £  vn(x)  :  F11  — »  WMAD)  is  the 

Veronese  map  given  by  (12),  x  =  [aq,  X2,  ■  ■  ■ ,  Xd]t  for  some  D  £  Z+,  and  Mn(D )  is  given  by  (11).  Therefore, 
each  point  yi,  i  =  1, . . . ,  N,  satisfies  p„.(x)  =  0,  and  hence,  Vn(D)cn  =  0,  where 


Vn(D)  4 


Vn(yi) 

vniy2) 


vI(Vn) 


(13) 


is  called  the  embedded  data  matrix.  A  one-to-one  correspondence  between  the  null  space  of  Vn{D)  and  the 
polynomials  in  In  exists  if 


dim(7V‘(I4  (£>)))  =  dim(J„)  =  hx(n), 


(14) 


or,  equivalently, 


rank  14(D)  =  Mn(D)  -  hx(n),  (15) 

where  the  Hilbert  function  hi(n)  is  the  number  of  linearly  independent  polynomials  of  degree  n  that  vanish  on 
Za-  The  singular  vectors  of  14(D)  denoted  by  cnj  £  i  =  1, . . . ,  hx(n),  corresponding  to  the  zero  singular 

values  of  14(D)  can  be  used  to  compute  a  basis  for  In,  namely 

ln  =  span{prei(a:)  =  cnivn{ x),  i  =  1, . . . ,  hx(n)}. 

In  the  case  where  the  data  set  y  is  corrupted  by  noise,  the  singular  vectors  corresponding  to  the  hx(n)  smallest 
singular  values  of  Vn  (D)  are  used  to  compute  the  basis  for  Tn . 

The  following  theorem  shows  how  polynomial  differentiation  can  be  used  to  find  the  dimensions  and  bases  of 
each  subspace  Sj,  j  =  1 , ...  ,n. 

Theorem  3. 2. 10,11  Let.  y  =  {y\,  2/2,  •  •  • ,  2hv}  C  R11  be  a  set  of  points  sampled  from  Za  =  <Si  U  £2  U  •  •  •  U  Sn, 
where,  for  j  =  1, . . .  ,n,  Sj  is  a  subspace  of  unknown  dimension  dj.  Furthermore,  assume  that  for  every  subspace 
Sj,  j  =  1, . . .  ,n,  there  exists  Wj  £  Sj  such  that  Wj  0  Si,  i  ^  j,  i  =  1, . . .  ,n,  and  condition  (If)  holds.  Then 

,  j  =  1, . . .  ,n,  (16) 

where  14  (D)  is  the  embedded  data  matrix  of  y  given  by  (13).  Furthermore,  dj  =  D  —  rank X/Pn(wj),  j  = 
1  where  Pn(x)  =  (pni(x),  pnxix),  . . .  ,pnhx(n)(x)]T  €  a  row  vector  of  independent  polyno¬ 

mials  in  Tn,  composed  of  the  singular  vectors  corresponding  to  the  zero  singular  values  ofVn(D),  and  VP„  = 
[VVniOr),  \tTPn2(x),  . . . ,  VTpnhx(n)(x)\  GRDxh^n\ 

To  select  a  point  Wj,  j  =  1, . . . ,  n,  for  each  subspace  such  that  Wj  £  Sj,  Wj  $  Si,  i  ^  j ,  i  =  1, ... ,  n,  without 
loss  of  generality,  let  j  =  n.  One  can  show  that  the  first  point  wn ,  where  wn  €  Sn  and  wn  0  S,,  i  =  1, . . . ,  n  —  1, 
is  given  by10,11 


Sj  —  span 


\<9a; 


D; {x)\x—w.  :  cn  £  Af  (14  (D)) 


Wn  = 


argmin^-y.  WPn{y)¥l0Pn  (y)  (VTP„  (y) VP„  {y))^Pj(y). 


(17) 


Furthermore,  a  basis  for  Sn  can  be  found  by  applying  PCA  to  VP„(u>n).  To  find  the  rest  of  the  points  Wi  G  Si, 
i  =  1, . . .  ,n  —  1,  we  can  use  polynomial  division  as  outlined  in  the  next  theorem. 

Theorem  3. 3. 10,11  Let  y  =  {y±,  2/2, ,  un}  C  R13  be  a  set  of  points  sampled  from  Z .4  =  <Si  U  S2  U  •  •  •  U  Sn, 
where,  for  j  =  1  Sj  is  a  subspace  of  unknown  dimension  dj.  Assume  (If)  holds,  Sf  is  known,  and  a 

point  wn  G  Sn  is  given.  Then,  the  set  Uj=i  Sj  characterized  by  the  set  of  homogenous  polynomials  given  by 

^c](_1un_1(x)  :  Vn(D)Rn(bn)cn-i  =  0,  bn  G  S„,  cn_  1  G  KMn-l(£l)| , 

where  Rn(bn)  G  ^M".(D)xMnSD)  js  iye  matrix  of  coefficients  of  cn- 1  when  (bflx){c^l_1vn- 1(2;))  =  cfvn(x)  is 
rearranged  to  have  the  form  Rn(bn)cn- 1  =  cn. 

Once  the  homogenous  polynomials  {cfl_1vn- 1(2:)}  given  by  Theorem  3.3  are  obtained,  an  identical  procedure 
can  be  repeated  to  find  wn- 1  and  the  homogenous  polynomials  characterizing  \J™=1Sj. 

3.2  Subspace  Estimation  Using  a  Voting  Scheme 

The  GPCA  framework  given  in  Subsection  3.1  works  well  in  the  absence  of  noise.  In  practice,  however,  noise  is 
always  present  and  efficient  statistical  methods  need  to  be  incorporated  with  the  GPCA.  In  this  subsection,  we 
present  one  such  statistical  method  where  a  voting  scheme  is  combined  with  the  GPCA.  Here,  we  assume  that 
the  number  of  the  subspaces  and  their  dimensions  are  known.  For  details,  see  Refs.  10  and  11. 

Let  y  =  { yi ,  2/2,  •  •  • ,  Pn}  C  R13  be  the  set  of  data  points  sampled  from  the  set  Z4  =  Si  U  S2  U  •  •  •  U  Sn, 
where,  for  j  =  1, . . . ,  n,  Sj  is  a  subspace  of  dimension  dj  and  co-dimension  Cj  =  D  —  dj.  As  noted  in  Subsection 
3.1,  the  homogenous  component  of  degree  n  of  the  vanishing  ideal  I(Zj£),  denoted  by  In,  uniquely  defines 
I(Z a)  and  dim(Z„)  =  hx(n),  where  hj(n)  is  the  Hilbert  function.  Let  P  =  {pi{x),  P2(x),  ...,  phxin)(x)}  be 
the  set  of  polynomials  forming  a  basis  for  Xn.  This  set  is  obtained  by  selecting  the  hx(n)  smallest  singular 
values  of  Vn(D),  where  Vn(D)  is  the  embedded  data  matrix  given  by  (13).  Let  y-\  G  A  and  define  VPb(2/i)  — 
[VTpi(j/i),  VTp2(yi),  ■  ■  • ,  VT Phx(n){yi)\  ■  Note  that  in  the  noise-free  case,  if  y\  G  Sj,  then  rank  VPb(j/i)  =  Cj. 

In  the  case  where  the  data  is  corrupted  by  noise,  a  more  efficient  method  for  computing  the  basis  is  desired. 
Suppose  the  co-dimension  of  the  subspaces  <Si,  S2,  ■  ■  ■ ,  S„  take  q  distinct  values  cf  c'2,  ■  ■  ■ ,  c'q,  respectively.  Given 
the  fact  that  the  membership  of  y±  to  one  of  the  subspaces  Sj,  j  =  1, . . . ,  n,  is  unknown,  a  set  of  basis  candidates 
for  the  orthogonal  complement  of  subspaces  of  all  possible  dimensions  c',  i  =  1 , ...  ,q,  is  calculated  by  choosing 
the  c'  principal  components  of  VPb(i/i).  This  results  in  q  matrices  Bi  G  Rr>xcy  i  =  1 ,...  ,q,  each  of  which  is  a 
basis  candidate  for  Sf ,  i  =  1, . . .  ,n. 

The  main  idea  of  the  voting  scheme  is  to  count  the  number  of  repetitions  of  each  basis  candidate  for  all  points 
in  the  data  set  y  =  {yi, . . .,  pn}-  The  n  basis  candidates  with  the  most  votes  are  chosen  to  be  the  basis  for  S)~, 
i  =  1, . . . ,  n,  and  each  point  is  assigned  to  its  closest  subspace.  In  our  criterion  for  counting  the  repetition  of  the 
basis  candidates,  two  basis  candidates  are  considered  to  be  the  same  if  the  angle  between  the  subspaces  spanned 
by  them  is  less  than  r,  where  r  >  0  is  a  given  tolerance  parameter. 

4.  UNSUPERVISED  LEARNING  OF  FACIAL  EXPRESSIONS 

The  MVU  and  GPCA  methods  presented  in  Sections  2  and  3  can  be  used  to  recognize  facial  expressions  from 
a  given  set  of  images  within  an  unsupervised  learning  framework.  Specifically,  given  a  set  of  images  of  a  person 
with  two  different  facial  expressions  (e.g.,  neutral  and  happy),  we  show  that  the  two  facial  expressions  reside  on 
two  distinct  subspaces  if  the  manifold  is  unfolded.  In  particular,  semi-definite  embedding  is  used  to  reduce  the 
dimensionality  and  unfold  the  manifold  of  facial  images.  Next,  generalized  principal  component  analysis  is  used 
to  fit  a  series  of  subspaces  to  the  data  points  and  associate  each  data  point  to  a  subspace.  Data  points  that 
belong  to  the  same  subspace  are  claimed  to  belong  to  the  same  facial  expression.  The  algorithm  is  summarized 
in  Table  1. 

In  our  experiment,  30  photographs  were  taken  for  each  human  subject,  where  the  subject  starts  by  a  neutral 
expression,  transitions  to  a  happy  expression,  and  goes  back  to  a  neutral  expression  with  each  part  containing 


Table  1.  Facial  Expression  Recognition  Algorithm 


Step  1.  Preprocess  of  the  grayscale  image  data  Ii,. ■ .,  In- 

a.  Compute  Xj  £  RD  by  column  stacking  the  matrix  of  Ij,  j  =  1, . . . ,  N. 

b.  Set  the  number  of  neighbors  k. 

c.  Form  the  weighted  graph  0. 

d.  Form  the  modified  graph  0'  and  the  adjacency  matrix  rj  = 

Step  2.  Manifold  unfolding  and  dimension  reduction. 

a.  Set  the  dimension  of  the  reduced  space  D. 

b.  Find  I\* ,  the  maximizer  of  (7)  subject  to  (8)— (10) . 

c.  Compute  the  eigenvectors  and  the  associated  eigenvalues  of  A'*;  denote  by 
Vj  =  [Vji,  Vj2,  ---,  VjD,]t  and  Xj,  j  =  l,...,N. 

d.  Reorder  Vj  and  A  j  such  that  Xj,  j  =  1, ...  ,N  are  in  decreasing  order. 

e.  Compute  yji  =  y/XjVji,  j  =  1,...,N,  i  =  l,...,D'. 

f.  Compute  yj  =  [yji,  yj2,  --■,  VjD')T,  j  =  l,...,N. 

g.  Compute  yfd  =  [yj  i,  . . . ,  yjD}T,  j  =  1,...,N. 

h.  Compute  >’  =  { y\eA ,  . . . ,  y^d}. 

Step  3.  Subspace  estimation  using  a  voting  scheme. 

a.  Set  the  subspace  angle  tolerance  parameter  r  >  0. 

b.  For  the  subspaces  <Si,  £2,  •  ■ . ,  Sn,  compute  the  distinct  value  of  their  co-dimension 

G )  ^2  5  *  *  *  ?  Cq  - 

c.  Initialize  the  arrays  Hi  =  [],...,  Hq  =  []  and  the  counters  Ci  =  Cq  =  [] 

d.  for  j  =  1  :  N 

e.  for  i  =  1  :  q 

f.  Compute  the  c'  principal  components  of  VPB(j/)ed)- 

g.  Form  the  orthogonal  matrix  Bi  £  R.£>xci  using  outputs  of  Step  3f. 

h.  if  there  exists  k  such  that  the  angle  ZsubsPac e(Bi,Ui(k))  <  r,  then 

i.  Ci(k)  <—  Ci(k)  +  1. 

j.  else 

k.  Ui  [Uu  Bi}. 

l.  Ci  v-  [Ci,  1]. 

m.  end  for 

n.  end  for 

o.  Select  basis  candidates  from  U\,  . . .  ,Uq  corresponding  to  the  n  highest  values  of 
Ci,  ...  ,Cq.  Denote  basis  by  B\, . . . ,  Bn. 

p.  Use  B 1, . . . ,  Bn  (basis  for  S^,  . . . ,  S^)  to  find  the  B[} . . . ,  B'n  (basis  for  Si,  . . . ,  Sn). 

q.  Use  results  in  Step  3p  to  assign  each  y[eA, . . . ,  y)|d,  to  the  closest  subspace. 

10  photographs.  An  example  set  of  images  is  given  in  Figure  2.  These  images  were  taken  in  a  sequence,  each 
200  x  240  pixels,  and  in  total  there  were  4  subjects. 

Each  image  can  be  considered  as  a  vector  of  dimension  48000  by  column  stacking  the  grey  scale  image  intensity 
matrix.  In  this  case,  each  image  is  a  point  in  a  48000-dimension  space.  In  order  to  segment  the  images,  the 
dimension  is  reduced  to  D  =  5  using  the  MYU  algorithm.  Then,  the  resulting  points  in  the  D  =  5-dimensional 
ambient  space  are  used  to  identify  2  subspaces  of  dimension  d  =  1,2,  3, 4,  where  in  the  GPCA  voting  algorithm 
two  subspaces  are  considered  to  be  the  same  if  the  angle  between  the  two  subspaces  is  less  than  r  =  0.4. 17  The 
segmentation  error  for  each  case  is  given  in  Table  2,  where  it  is  noted  that  the  best  results  correspond  to  d  =  3 
and  4.  In  order  to  visualize  the  subspace  identification,  the  segmentation  for  the  case  D  =  2  and  d  =  1  is  given 
in  Figure  3.  Although  these  parameters  result  in  a  poor  segmentation  performance,  it  graphically  conveys  the 
main  idea  of  the  algorithm. 
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Figure  2.  A  sequence  of  pictures,  where  the  subject  starts  with  a  neutral  expression,  smiles,  and  resumes  to  a  neutral 
expression. 

Table  2.  Segmentation  Results  for  D  =  5 


Subject 

Number  of  Images 

d  = 

Segmentation  Error 

1  d= 2  d= 3  d= 4 

1 

29 

3 

2 

2 

3 

2 

31 

13 

13 

3 

7 

3 

31 

6 

15 

2 

4 

4 

32 

13 

15 

1 

1 

5.  CONCLUSION 

In  this  paper,  we  considered  facial  expression  recognition  within  an  unsupervised  learning  framework.  Specifically, 
given  a  data  set  composed  of  a  number  of  facial  images  of  the  same  subject  with  different  facial  expressions,  the 
algorithm  introduced  in  this  paper  segments  the  data  set  into  groups  corresponding  to  different  facial  expressions. 
Each  facial  image  can  be  regarded  as  a  point  in  a  high-dimensional  space,  and  the  collection  of  images  of  the 
same  subject  resides  on  a  manifold  within  this  space.  Our  results  show  that  different  facial  expressions  reside 
on  distinct  subspaces  if  the  manifold  is  unfolded.  In  particular,  semi-definite  embedding  was  used  to  reduce  the 
dimensionality  and  unfold  the  manifold  of  facial  images.  Generalized  principal  component  analysis  was  used  to 
fit  a  series  of  subspaces  to  the  data  points  and  associate  each  data  point  to  a  subspace.  Data  points  that  belong 
to  the  same  subspace  were  shown  to  belong  to  the  same  facial  expression.  In  future  research  we  will  extend  the 
results  to  an  unknown  number  of  different  facial  expressions,  and  apply  the  framework  to  the  problem  of  facial 
expression  recognition  for  video  imaging. 
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Figure  3.  Facial  expression  segmentation  with  D  =  2  and  d  =  1.  The  categorization  error  is  6/30.  The  solid  and  dashed 
lines  are  the  subspaces  corresponding  to  the  neutral  and  happy  expressions,  respectively.  The  points  associated  with  the 
solid  line  and  the  dashed  line  are  depicted  by  “+”  and  “x”,  respectively.  The  points  depicted  by  “o”  are  points  associated 
with  the  wrong  expression.  Note  that  although  these  parameters  result  in  a  poor  segmentation  performance,  it  graphically 
conveys  the  main  idea  of  the  algorithm. 
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